A post about using apache arrow and efficiently visualizing in R
Almost everyone has heard of the NYC taxi data at this point, in its current form it is about 24 columns and just under 1.7 billion rows. Each row represents a ride that occurred somewhere between 2009 and 2022. The important columns in this visualization are
pickup_longitude (double): Longitude data for the pickup locationpickup_latitude (double): Latitude data for the pickup locationdropoff_longitude (double): Longitude data for the dropoff locationdropoff_latitude (double): Latitude data for the dropoff locationThere are the libraries used for tranforming and visualizing.
This transfers the nearly 70 gigabytes of taxi data to my computer.
copy_files(
from = s3_bucket("ursa-labs-taxi-data-v2"),
to = "~/Downloads/nyc-taxi"
)
The datasets can subsequently be opened as such.
nyc_taxi_tiny <- open_dataset("~/Downloads/nyc-taxi-tiny/")
nyc_taxi <- open_dataset("~/Downloads/nyc-taxi/")
glimpse(nyc_taxi_tiny)
Let’s check out nyc_pickups
glimpse(nyc_pickups)
This goes pretty quick now
x0 <- -74.05 # minimum longitude to plot
y0 <- 40.6 # minimum latitude to plot
span <- 0.3 # size of the lat/long window to plot
tic()
pic <- ggplot(nyc_pickups) +
geom_point(
aes(pickup_longitude, pickup_latitude),
size = .2,
stroke = 0,
colour = "#800020"
) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_void() +
coord_equal(
xlim = x0 + c(0, span),
ylim = y0 + c(0, span)
)
pic
toc()
tic()
pixels <- 4000
pickup <- nyc_taxi |>
dplyr::filter(
!is.na(pickup_longitude),
!is.na(pickup_latitude),
pickup_longitude > x0,
pickup_longitude < x0 + span,
pickup_latitude > y0,
pickup_latitude < y0 + span
) |>
dplyr::mutate(
unit_scaled_x = (pickup_longitude - x0) / span,
unit_scaled_y = (pickup_latitude - y0) / span,
x = as.integer(round(pixels * unit_scaled_x)),
y = as.integer(round(pixels * unit_scaled_y))
) |>
dplyr::group_by(x, y) |>
dplyr::summarise(pickup = n()) |>
# dplyr::count(x, y, name = "pickup") |>
dplyr::collect()
toc()
My laptop solves this in 27.51 seconds, again lets take a look at the resulting dataframe.
glimpse(pickup)
tic()
grid <- expand_grid(x = 1:pixels, y = 1:pixels) |>
left_join(pickup, by = c("x", "y")) |>
mutate(pickup = replace_na(pickup, 0))
toc()
knitr::include_graphics(here::here("imgs/taxi_viz_billion_rows.png"))